NVIDIA Nemotron 3 Super 120B A12B FP8

NVIDIA · Chat / LLM · 120B Parameters (12B Active) · 256K Context (up to 1M)

Function Calling Streaming Reasoning Agent Workflows Long Context Code Tool Use

Overview

NVIDIA Nemotron-3 Super 120B A12B FP8 is an open-weight LLM built for agentic reasoning and high-volume enterprise workloads. Using a hybrid LatentMoE architecture (Mamba-2 + MoE + Attention) with Multi-Token Prediction (MTP) and native NVFP4 pretraining on 25T tokens, it delivers up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B. With a native 1M-token context window, configurable thinking mode, and 60.47% on SWE-Bench Verified, it is purpose-built for collaborative agents, long-context reasoning, and IT automation across 7 languages — served instantly via the Qubrid AI Serverless API.

⚡ 2.2x throughput vs GPT-OSS-120B. 1M token context. 512 experts, 22 active per token. Deploy on Qubrid AI — no H100 cluster required.

Model Specifications

Field	Details
Model ID	`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8`
Provider	NVIDIA
Kind	Chat / LLM
Architecture	LatentMoE — Mamba-2 + MoE + Attention hybrid with MTP; 512 experts, 22 active per token; 120B total / 12B active
Parameters	120B total (12B active per inference pass)
Context Length	256K Tokens (up to 1M)
MoE	No
Release Date	March 11, 2026
License	NVIDIA Nemotron Open Model License
Training Data	25T token corpus (NVFP4 native pretraining): web, code, math, science, multilingual; post-training cutoff February 2026; pre-training cutoff June 2025
Function Calling	Supported
Image Support	N/A
Serverless API	Available
Fine-tuning	Coming Soon
On-demand	Coming Soon
State	🟢 Ready

Pricing

💳 Access via the Qubrid AI Serverless API with pay-per-token pricing. No infrastructure management required.

Token Type	Price per 1M Tokens
Input Tokens	$0.10
Input Tokens (Cached)	$0.04
Output Tokens	$0.50

Quickstart

Prerequisites

Create a free account at platform.qubrid.com
Generate your API key from the API Keys section
Replace QUBRID_API_KEY in the code below with your actual key

💡 Temperature & Top P: Use temperature=1 and top_p=0.95 — recommended for all tasks with this model.

Python

from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
    base_url="https://platform.qubrid.com/v1",
    api_key="QUBRID_API_KEY",
)

# Create a streaming chat completion
stream = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
    messages=[
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ],
    max_tokens=16000,
    temperature=1,
    top_p=0.95,
    stream=True
)

# If stream = False comment this out
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

# If stream = True comment this out
print(stream.choices[0].message.content)

JavaScript

import OpenAI from "openai";

// Initialize the OpenAI client with Qubrid base URL
const client = new OpenAI({
  baseURL: "https://platform.qubrid.com/v1",
  apiKey: "QUBRID_API_KEY",
});

// Create a streaming chat completion
const stream = await client.chat.completions.create({
  model: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
  messages: [
    {
      role: "user",
      content: "Explain quantum computing in simple terms",
    },
  ],
  max_tokens: 16000,
  temperature: 1,
  top_p: 0.95,
  stream: true,
});

// If stream = false comment this out
for await (const chunk of stream) {
  if (chunk.choices[0]?.delta?.content) {
    process.stdout.write(chunk.choices[0].delta.content);
  }
}
console.log("\n");

// If stream = true comment this out
console.log(stream.choices[0].message.content);

Go

package main

import (
	"bufio"
	"bytes"
	"encoding/json"
	"fmt"
	"net/http"
)

func main() {
	url := "https://platform.qubrid.com/v1/chat/completions"

	data := map[string]interface{}{
		"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
		"messages": []map[string]string{
			{
				"role":    "user",
				"content": "Explain quantum computing in simple terms",
			},
		},
		"temperature": 1,
		"max_tokens":  16000,
		"stream":      true,
		"top_p":       0.95,
	}

	jsonData, _ := json.Marshal(data)
	req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
	req.Header.Set("Authorization", "Bearer QUBRID_API_KEY")
	req.Header.Set("Content-Type", "application/json")

	client := &http.Client{}
	res, _ := client.Do(req)
	defer res.Body.Close()

	scanner := bufio.NewScanner(res.Body)
	for scanner.Scan() {
		line := scanner.Text()
		if line != "" {
			fmt.Println(line)
		}
	}
}

cURL

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 1,
  "max_tokens": 16000,
  "stream": true,
  "top_p": 0.95
}'

Live Example

Prompt: What are the benefits of renewable energy?

Response:

Renewable energy offers a broad set of benefits spanning environmental,
economic, and strategic dimensions:

Environmental
  • Zero direct emissions during operation — solar, wind, and hydro produce
    no CO₂, SOₓ, or NOₓ at the point of generation
  • Reduces dependence on fossil fuel extraction, lowering habitat disruption,
    water contamination risk, and land use pressure
  • Contributes directly to national and international decarbonization targets
    (Paris Agreement, net-zero commitments)

Economic
  • Levelized cost of energy (LCOE) for solar and onshore wind is now lower
    than new coal or gas plants in most markets
  • Creates long-term stable energy pricing — no fuel cost volatility
  • Generates local employment in manufacturing, installation, and maintenance
  • Attracts investment: clean energy is the fastest-growing segment of
    infrastructure capital globally

Energy Security
  • Reduces import dependency on oil, gas, and coal from geopolitically
    unstable regions
  • Distributed generation (rooftop solar, microgrids) improves grid
    resilience against outages and cyberattacks
  • Nations with abundant sun, wind, or hydro can achieve near-complete
    domestic energy independence

Technological Spillovers
  • Advances in battery storage, smart grids, and power electronics driven
    by renewables deployment benefit the broader economy
  • Supports electrification of transport, heating, and industrial processes

Tradeoffs to consider:
  • Intermittency (solar/wind) requires storage or grid balancing solutions
  • Large-scale installations have land and materials requirements
  • Upfront capital costs remain higher for some technologies vs. fossil alternatives

Net assessment: for most economies, renewable energy now represents the most
cost-effective, low-risk long-term energy strategy available.

Try it yourself in the Qubrid AI Playground →

Playground Features

The Qubrid AI Playground lets you interact with Nemotron-3 Super 120B directly in your browser — no setup, no code, no cost to explore.

🧠 System Prompt

Define the model’s role, reasoning mode, and output constraints before the conversation begins. Particularly powerful for agentic pipelines, tool-use workflows, and structured enterprise tasks.

Example: "You are an enterprise IT automation agent. Analyze incoming support
tickets, classify them by severity and category, suggest resolution steps,
and escalate critical issues with a structured JSON summary."

Set your system prompt once in the Qubrid Playground and it applies across every turn of the conversation.

🎯 Few-Shot Examples

Guide the model’s output structure and reasoning depth with concrete examples — no fine-tuning required. Especially effective for structured outputs and multi-step agentic tasks.

User Input	Assistant Response
`Ticket: "Server keeps crashing every 12 hours." Priority?`	`Priority: HIGH. Category: Infrastructure Stability. Suggested action: Check system logs for OOM events, review cron jobs scheduled near crash window, and verify disk I/O health.`
`Summarize this 50-page policy document in 5 bullet points`	`• Scope: Applies to all employees handling customer PII. • Key requirement: Data must be encrypted at rest and in transit. • Breach protocol: Notify DPO within 72 hours. • Retention: 7-year maximum. • Non-compliance: Subject to disciplinary review.`

💡 Stack multiple few-shot examples in the Qubrid Playground to shape agentic behavior, output schema, and reasoning verbosity — no fine-tuning required.

Inference Parameters

Parameter	Type	Default	Description
Streaming	boolean	`true`	Enable streaming responses for real-time output
Temperature	number	`1`	Controls randomness in output. Recommended: `1.0` for all tasks
Max Tokens	number	`16000`	Maximum tokens to generate
Top P	number	`0.95`	Controls nucleus sampling. Recommended: `0.95` for all tasks

Use Cases

Agentic workflows and multi-agent collaboration
Long-context reasoning (up to 1M tokens)
IT ticket automation and high-volume enterprise workloads
Complex tool use and multi-step function calling
RAG (Retrieval-Augmented Generation)
Software engineering and cybersecurity triaging

Strengths & Limitations

Strengths	Limitations
LatentMoE: 512 experts / 22 active per token at same compute cost as standard MoE	Requires minimum 2× H100-80GB GPUs for local deployment
2.2x throughput vs GPT-OSS-120B; 7.5x vs Qwen3.5-122B	Thinking mode adds latency overhead; low-effort mode recommended for simple queries
60.47% SWE-Bench Verified; 83.73% MMLU-Pro; 79.23% GPQA	Not optimized for vision or multimodal inputs
Native 1M token context — 91.75% on RULER @ 1M	Function calling supported but may need prompt engineering for complex schemas
MTP speculative decoding: 3.45 avg acceptance length (up to 3x wall-clock speedup)
Configurable reasoning mode via `enable_thinking=True/False`

Why Qubrid AI?

🚀 No infrastructure setup — 120B MoE served serverlessly, pay only for what you use
🔁 OpenAI-compatible — drop-in replacement using the same SDK, just swap the base URL
💰 Cached input pricing — $0.04/1M for cached tokens, critical for long-context and repeated RAG workloads
⚡ Throughput-optimized — Nemotron’s 2.2x speed advantage is fully realized on Qubrid’s low-latency infrastructure
🧪 Built-in Playground — prototype with system prompts and few-shot examples instantly at platform.qubrid.com
📊 Full observability — API logs and usage tracking built into the Qubrid dashboard

Resources

Resource	Link
📖 Qubrid Docs	docs.platform.qubrid.com
🎮 Playground	Try Nemotron-3 Super 120B live
🔑 API Keys	Get your API Key
🤗 Hugging Face	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
💬 Discord	Join the Qubrid Community

Built with ❤️ by Qubrid AI

Frontier models. Serverless infrastructure. Zero friction.

Getting started

GPU Compute

Inferencing

Qubrid AI Models

AI Tools

NVIDIA Nemotron 3 Super 120B A12B FP8

Overview

Model Specifications

Pricing

Quickstart

Prerequisites

Python

JavaScript

Go

cURL

Live Example

Playground Features

🧠 System Prompt

🎯 Few-Shot Examples

Inference Parameters

Use Cases

Strengths & Limitations

Why Qubrid AI?

Resources

Getting started

GPU Compute

Inferencing

Qubrid AI Models

AI Tools

Documentation Index

​Overview

​Model Specifications

​Pricing

​Quickstart

​Prerequisites

​Python

​JavaScript

​Go

​cURL

​Live Example

​Playground Features

​🧠 System Prompt

​🎯 Few-Shot Examples

​Inference Parameters

​Use Cases

​Strengths & Limitations

​Why Qubrid AI?

​Resources

Overview

Model Specifications

Pricing

Quickstart

Prerequisites

Python

JavaScript

Go

cURL

Live Example

Playground Features

🧠 System Prompt

🎯 Few-Shot Examples

Inference Parameters

Use Cases

Strengths & Limitations

Why Qubrid AI?

Resources